Str_Detect

The miracle of text detection with the wonders of THE Tidyverse!

Hello! This RMarkdown is meant to explain what the tidyverse function “str_detect” is all about and the ways you can incorporate it into your data analysis and journalistic inquiries!

So first up, we are going to load in a data set – this one happens to be from the The Armed Conflict Location & Event Data Project (ACLED), which since as early as 2014, has been tracking protests and armed conflict globally – and making the data available for public use. The data is global from the time span of January 2020 to April 2022, and was downloaded using the ACLED’s data download function. It was then read into this markdown as a csv file and renamed acled_data.

Below is the code chunk for loading in the csv file from the ACLED website!

acled_data <-
  read_csv( "acled_protests.csv")
## Rows: 582032 Columns: 31
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): event_id_cnty, event_date, event_type, sub_event_type, actor1, ass...
## dbl (13): data_id, iso, event_id_no_cnty, year, time_precision, inter1, inte...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
acled_data %>%
  glimpse
## Rows: 582,032
## Columns: 31
## $ data_id          <dbl> 8909166, 8909185, 8909186, 8909192, 8909221, 8909222,…
## $ iso              <dbl> 364, 275, 275, 275, 275, 275, 275, 275, 792, 792, 422…
## $ event_id_cnty    <chr> "IRN18074", "PSE14302", "PSE14303", "PSE14317", "PSE1…
## $ event_id_no_cnty <dbl> 18074, 14302, 14303, 14317, 14358, 14360, 14363, 1436…
## $ event_date       <chr> "18 March 2022", "18 March 2022", "18 March 2022", "1…
## $ year             <dbl> 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022, 2022,…
## $ time_precision   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ event_type       <chr> "Protests", "Protests", "Riots", "Riots", "Violence a…
## $ sub_event_type   <chr> "Peaceful protest", "Peaceful protest", "Violent demo…
## $ actor1           <chr> "Protesters (Iran)", "Protesters (Palestine)", "Riote…
## $ assoc_actor_1    <chr> "Labour Group (Iran)", "Muslim Group (Palestine); Pro…
## $ inter1           <dbl> 6, 6, 5, 5, 8, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 1, 6, 6,…
## $ actor2           <chr> NA, NA, "Military Forces of Israel (2021-)", "Rioters…
## $ assoc_actor_2    <chr> NA, NA, NA, "Civilians (Palestine)", "Farmers (Palest…
## $ inter2           <dbl> 0, 0, 8, 5, 7, 8, 8, 8, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0,…
## $ interaction      <dbl> 60, 60, 58, 55, 78, 58, 58, 58, 60, 60, 60, 60, 60, 6…
## $ region           <chr> "Middle East", "Middle East", "Middle East", "Middle …
## $ country          <chr> "Iran", "Palestine", "Palestine", "Palestine", "Pales…
## $ admin1           <chr> "Alborz", "West Bank", "West Bank", "West Bank", "Gaz…
## $ admin2           <chr> "Karaj", "Al Quds", "Nablus", "Al Quds", "Gaza City",…
## $ admin3           <chr> "Central", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ location         <chr> "Karaj", "Al Quds - Old City", "Bayta", "Al Quds - Ol…
## $ latitude         <dbl> 35.8327, 31.7767, 32.1414, 31.7767, 31.5134, 31.8031,…
## $ longitude        <dbl> 50.9916, 35.2342, 35.2855, 35.2342, 34.4751, 35.2870,…
## $ geo_precision    <dbl> 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,…
## $ source           <chr> "Eteraze Bazar; Iran Kargar", "Dunia Al Watan; Palest…
## $ source_scale     <chr> "New media", "National", "National", "National", "Nat…
## $ notes            <chr> "On 18 March 2022, drivers who lost vehicles in the S…
## $ fatalities       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ timestamp        <dbl> 1647870024, 1647870024, 1647870024, 1647870024, 16478…
## $ iso3             <chr> "IRN", "PSE", "PSE", "PSE", "PSE", "PSE", "PSE", "PSE…

Upon first “glimpse” of our csv file, there are 582,032 rows in the data set!

So for this example of what the str_detect function does, we will be looking at immigration related events from 2020 till 2022, tracked by the ACLED.

Str_detect – what is it?

The str_detect function of the tidyverse is a way for you to basically find keywords within your data set. This works by creating a command line to find a specific set of letters, or a phrase within text bound variables in your database. For this example we are using the ACLED database which contains a variable called “notes”.

The “notes” section contains a written narrative of the kinds of conflicts the organization has tracked. For this data frame – focused on immigration related events from 2020 till 2022 – we are going to create a str_detect command to find out the ACLED events that mention anything related to immigration or “border” in the notes column.

This data frame also includes a filter narrowing down the events shown to the United States, as well as only displaying the columns listed in the “select” function of the code chunk.

The code chunk below should, in theory, turn up any mention of “immigration” or “border” within the notes column of the ACLED data set. This is written in the code as “immigra” “|” (divider line) and “border” to show that you want the code to query both the beginning of “immigration” and “border” together – but NOT a run on phrase.

acled_data %>%
  filter( country %in% c("United States") ,
          str_detect( notes, "immigra|border"))  %>%
  select( data_id , notes, event_date , country , region ,
          event_type , actor1:assoc_actor_2, location , source ,  admin2, admin3)

Okay! Here’s our first example of how a str_detect works! So we went from 582,032 rows to only about 719 rows when we narrowed our search of the data frame down to only events that mention immigration or border within the “notes” column of the data!

But…What else??

So, finding 719 immigration related events tracked by the ACLED wasn’t enough? Okay…what about something more topical?

On February 24th, 2022 Vladimir Putin ordered Russian military forces into Ukraine – in an act that is widely being called an unprovoked invasion on a sovereign country. Like in decades before, Russia’s dicey diplomatic agenda has spurred mass outrage – and there has subsequently been events tracked by the ACLED that are spurred on by this invasion.

Below we are going to be implementing our str_detect function in order to find out how many events that the group tracks, have been about the Russian Ukrainian war.

So you set up the code chunk in the same way, except for you replace “immigra|border” to “Ukraine” – still in the notes section – to find a rough estimate of the number of events related to Ukraine.

acled_data %>%
  filter( country %in% c("United States") ,
          str_detect( notes, "Ukraine")) %>%
  count( inter1 , sort=TRUE, name="uniq_inter1") 

PS* we added a “count” function in order to simply count all the events that have “Ukraine” in the notes narrative. This should give us the same

Where to go from here

So you’ve learned the basics of str_detect! Where do you go from here?

Well, the str_detect function has a wide range of uses, not just on text features. You can find a wide range of information from querying from the function.

In the ACLED data – there are numerous variables that can have a wealth of information for data analytic or story ideas. One of the ways str_detect

For instance – what if we wanted to find only events from a certain organizing group? Well first we would need to have the name of the organization you are searching for. Let’s use “Veterans on Patrol” in order to find some of these. Here’s a trick – using a “^” before the word in your query will only show groups that start with “Vet”. Check it out below:

acled_data %>%
  filter( country %in% c("United States") ,
          str_detect( actor1 , "^Vet"))

So you can see from the table above that we ran into a problem…and a common problem for str_detect is that there are a ton of different words in this column that start with “Vet” – so lets narrow it down. If you search the data set, you’ll see that the Veterans on Patrol entries always start with “VOP:” – lets make that our query. See it below:

acled_data %>%
  filter( country %in% c("United States") ,
          str_detect( actor1 , "^VOP:"))

And when you use that code line – you see that there are 57 events out of the 719 that we started with, attributed to VOP: Veterans on Patrol. This is a simple way to query your data, especially when you’ve narrowed a potential story subject down, and want to only see those events. Likewise – this can be used to find loan data (from perhaps the PPP Loans from early in the pandemic) or even law enforcement data to aggregate where hot spot trends in police departments are.

This is just the beginning of what is capable with this function. But now, you’re ready to rock n’ roll.

I hope this has helped introduce you to the str_detect function in the “tidyverse” in R!

---
title: "Explainer: FUN with Str_Detect!"
author: "nathan collins"
date: "4/27/2022"
output: 
  html_document:
    toc: true
    toc_float: true
    df_print: paged
    code_download: true
    code_folding: hide
    theme: cosmo
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
    echo = TRUE,
    message = TRUE,
    warning = TRUE)

library(tidyverse)
library(janitor)
library(lubridate)
library(reactable)

```

## Str_Detect

## The miracle of text detection with the wonders of THE Tidyverse!

Hello! This RMarkdown is meant to explain what the tidyverse function "str_detect" is all about and the ways you can incorporate it into your data analysis and journalistic inquiries!

So first up, we are going to load in a data set -- this one happens to be from the The Armed Conflict Location & Event Data Project (ACLED), which since as early as 2014, has been tracking protests and armed conflict globally -- and making the data available for public use. The data is global from the time span of January 2020 to April 2022, and was downloaded using the ACLED's data download function. It was then read into this markdown as a csv file and renamed acled_data.

Below is the code chunk for loading in the csv file from the ACLED website!

```{r}
acled_data <-
  read_csv( "acled_protests.csv")
```

```{r}
acled_data %>%
  glimpse

```

Upon first "glimpse" of our csv file, there are 582,032 rows in the data set!

So for this example of what the str_detect function does, we will be looking at immigration related events from 2020 till 2022, tracked by the ACLED.

## Str_detect -- what is it?

![](JOHN.gif)

The str_detect function of the tidyverse is a way for you to basically find keywords within your data set. This works by creating a command line to find a specific set of letters, or a phrase within text bound variables in your database. For this example we are using the ACLED database which contains a variable called "notes".

The "notes" section contains a written narrative of the kinds of conflicts the organization has tracked. For this data frame -- focused on immigration related events from 2020 till 2022 -- we are going to create a str_detect command to find out the ACLED events that mention anything related to immigration or "border" in the notes column.

This data frame also includes a filter narrowing down the events shown to the United States, as well as only displaying the columns listed in the "select" function of the code chunk.

The code chunk below should, in theory, turn up any mention of "immigration" or "border" within the notes column of the ACLED data set. This is written in the code as "immigra" "\|" (divider line) and "border" to show that you want the code to query both the beginning of "immigration" and "border" together -- but NOT a run on phrase.

```{r}
acled_data %>%
  filter( country %in% c("United States") ,
          str_detect( notes, "immigra|border"))  %>%
  select( data_id , notes, event_date , country , region ,
          event_type , actor1:assoc_actor_2, location , source ,  admin2, admin3)
```

Okay! Here's our first example of how a str_detect works! So we went from 582,032 rows to only about 719 rows when we narrowed our search of the data frame down to only events that mention immigration or border within the "notes" column of the data!

## But...What else??

So, finding 719 immigration related events tracked by the ACLED wasn't enough? Okay...what about something more topical?\
\
On February 24th, 2022 Vladimir Putin ordered Russian military forces into Ukraine -- in an act that is widely being called an unprovoked invasion on a sovereign country. Like in decades before, Russia's dicey diplomatic agenda has spurred mass outrage -- and there has subsequently been events tracked by the ACLED that are spurred on by this invasion.

Below we are going to be implementing our str_detect function in order to find out how many events that the group tracks, have been about the Russian Ukrainian war.

So you set up the code chunk in the same way, except for you replace "immigra\|border" to "Ukraine" -- still in the notes section -- to find a rough estimate of the number of events related to Ukraine.

```{r}
acled_data %>%
  filter( country %in% c("United States") ,
          str_detect( notes, "Ukraine")) %>%
  count( inter1 , sort=TRUE, name="uniq_inter1") 

```

PS\* we added a "count" function in order to simply count all the events that have "Ukraine" in the notes narrative. This should give us the same

## Where to go from here

So you've learned the basics of str_detect! Where do you go from here?

Well, the str_detect function has a wide range of uses, not just on text features. You can find a wide range of information from querying from the function.

In the ACLED data -- there are numerous variables that can have a wealth of information for data analytic or story ideas. One of the ways str_detect

For instance -- what if we wanted to find only events from a certain organizing group? Well first we would need to have the name of the organization you are searching for. Let's use "Veterans on Patrol" in order to find some of these. Here's a trick -- using a "\^" before the word in your query will only show groups that start with "Vet". Check it out below:

```{r}
acled_data %>%
  filter( country %in% c("United States") ,
          str_detect( actor1 , "^Vet"))
```

So you can see from the table above that we ran into a problem...and a common problem for str_detect is that there are a ton of different words in this column that start with "Vet" -- so lets narrow it down. If you search the data set, you'll see that the Veterans on Patrol entries always start with "VOP:" -- lets make that our query. See it below:

```{r}
acled_data %>%
  filter( country %in% c("United States") ,
          str_detect( actor1 , "^VOP:"))
```

And when you use that code line -- you see that there are 57 events out of the 719 that we started with, attributed to VOP: Veterans on Patrol. This is a simple way to query your data, especially when you've narrowed a potential story subject down, and want to only see those events. Likewise -- this can be used to find loan data (from perhaps the PPP Loans from early in the pandemic) or even law enforcement data to aggregate where hot spot trends in police departments are.

This is just the beginning of what is capable with this function. But now, you're ready to rock n' roll.

I hope this has helped introduce you to the str_detect function in the "tidyverse" in R!

![](mick.gif)
